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ABSTRACT 

Cursor tracking data contains information about website vis¬ 
itors which may provide new ways to understand visitors 
and their needs. This paper presents an Amazon Mechan¬ 
ical Turk study where participants were tracked as they 
used modified variants of the Wikipedia and BBC News 
websites. Participants were asked to complete reading and 
informat ion-finding tasks. The results showed that it was 
possible to differentiate between users reading content and 
users looking for information based on cursor data. The ef¬ 
fects of website aesthetics, user interest and cursor hardware 
were also analysed which showed it was possible to iden¬ 
tify hardware from cursor data, but no relationship between 
cursor data and engagement was found. The implications 
of these results for web analytics and the design of user en¬ 
gagement experiments are discussed. 

I. INTRODUCTION 

Internet-based businesses rely on a variety of measure¬ 
ments to evaluate performance, improve services, target ad¬ 
vertising and personalise a website for its visitors. Analysis 
is generally based on data such as page views, browsing his¬ 
tory and search terms. However, it is likely that more data 
can be gathered by examining interactions such as cursor 
movements. Cursor tracking involves recording the position 
of the cursor as a user interacts with a webpage. Only a 
small number of existing analytic tools offer cursor track¬ 
ing, and those that do generally provide heatmaps of the 
data showing the most common cursor positions. While this 
could be useful when testing usability, it does not provide 
data that can be used for user profiling. 

Many researchers have argued that cursor tracking data 
can provide a new way to learn about website visitors [8] 

II. Existing work shows that cursor movements often cor¬ 
relate with gaze [6, 11, [22], which suggests that some of 

*This work was carried out while undertaking an internship 
at Yahoo Labs in Barcelona. 


the techniques employed in gaze tracking studies could have 
analogues suitable for use with cursor tracking data. If so, 
this could allow inexpensive and large-scale usability testing 
to be carried out ‘in the wild’ using analysis methods that 
could previously only be used in a lab-based setting. Other 
researchers have suggested that cursor tracking data could 
reveal user engagement [8] , age and disability [5T] 23 


tal pressure levels [24], emotional state [15] and may even be 
able to identify individual users much like a fingerprint [9 . 
This suggests that cursor tracking data could be a valuable 
asset when profiling website visitors. 


This paper presents an Amazon Mechanical Turk (MTurk) 
study that asked users to complete tasks on live websites us¬ 
ing their own hardware in their natural environment. The 
aim of the study was to explore how cursor tracking data 
might tell us more about the user than could be measured 
using traditional means. The study explored several met¬ 
rics that might be used when carrying out cursor tracking 
analyses, and used that data to demonstrate that the user’s 
hardware and their browsing intent (search for content vs. 
reading content) could be predicted from cursor movements 
alone. The study also revealed a number of important prac¬ 
tical issues that need to be addressed when attempting to 
use cursor-tracking data, such as how the user’s hardware 
will affect the movements of the cursor. 


The structure of this paper is as follows: background work 
is presented in Section [2] followed by the design of the study 
and the procedure used in Sections [3] and [4] The results of 
the study are presented in Section [5] and are followed by 
a discussion of the implications in Section [6] Finally, the 
findings are summarised and avenues for further research 
are outlined in Section 0 


2. BACKGROUND 

One of the most difficult website performance metrics to 
accurately measure is user engagement , generally defined as 
the amount of attention and time a visitor is willing to spend 
on a given website and how likely they are to return. En¬ 
gagement is usually described as a combination of various 
different characteristics [4] [20] [18]. Attfield et al. [2] dis¬ 
cussed several characteristics of user engagement that are 
difficult to measure including aesthetics and novelty. These 
would traditionally be measured using physiological sensors 
( e.g. gaze tracking) or surveys. However, it may be possible 
that this information could be gathered through an analysis 
of cursor data. 




• Can cursor tracking methods translate across websites? 


Most existing analytic tools focus on a visitor’s transition 
to and from a webpage, using data such as the time spent on 
the page, the referring page and the page the user went to 
next. Edmonds et al. [8] argued that “ within-page activity 
could inform Web designers about the quality of the content 
on a page ”, information that is difficult to reliably extract 
using existing measurements. Edmonds et al. argued for 
measuring engagement using tools such as cursor-tracking, 
noting that there is considerable work correlating eye move¬ 
ment and gaze |6, 11 j. 

Huang & White [Tl] supported this idea and identified 
several distinct cursor movements in existing work includ¬ 
ing “ reading, hesitating, highlighting, marking, and actions 
such as scrolling and clicking ”. Rodden et al. [22] identified 
several specific behaviours that u seemed to indicate active 
use of the mouse to help the user process the content of the 
search result page ”, which were following the mouse in the 
x and y directions and using the mouse to mark the posi¬ 
tion of interesting results. Rodden’s work mostly confirms 
earlier findings by Claypool et al. | 7]. H owever, there was 
one notable difference: Rodden et al. [22 suggested the user 
would neglect the mouse while reading, whereas Claypool et 
al. [7] suggested that users would frequently use the mouse 
to follow text when reading. This work suggests that cursor 
movements, like eye movements, will follow a distinct set of 
patterns and behaviours that reflect the user’s activities. 

Most existing work is focussed solely on the mouse as a 
hardware device. However, research has shown that cursor 
movement properties vary between hardware devices [l] [To], 
making it unclear if findings for the mouse can be applied 
to other input devices. Other factors that have been shown 
to affect cursor movements include age | : 2l], motor disabil¬ 
ity [23] and mental pressure levels [24]. These factors may 
confound the task of interpreting cursor data, but could per¬ 
haps be taken into account if known. Unfortunately it must 
be assumed that nothing is known about the user in the 
context of online cursor tracking. However, if the user’s in¬ 
teractions are observed, it may be possible to predict some 
of the missing information using statistical models. 

3. STUDY DESIGN 

This section presents the design of a study that evaluated 
cursor tracking metrics ‘in the wild’, taking input hardware 
into account, in order to identify what they can reveal about 
the user. 

3.1 Research Questions 

The overall research goal was to identify if it is possible to 
measure user factors (such as interest in the content) solely 
from cursor data. To explore this, the study was designed 
to address the following research questions. 

• How does the browsing goal of the user (reading vs. 
searching) affect cursor movements? 

• Does the aesthetic appeal of a website affect cursor 
behaviour? 

• Does the user’s interest in the content affect cursor 
behaviour? 


• What is the impact of input hardware on cursor track¬ 
ing? 

• Can cursor behaviour be translated into user engage¬ 
ment metrics? 

3.2 Design 

The study was a mixed-model design with several groups. 
Two websites were used to ensure that the results would 
not be limited to a particular context. For each website two 
interfaces were created: one that would appear as normal 
and one that was intended to be aesthetically unappealing. 
Participants were asked to rate their interest in the website’s 
subject matter and were given tasks related to high-interest 
or low-interest subjects. As two websites were used in the 
study, there were eight groups in total. 

Once allocated to a group, participants were given two 
different tasks to carry out. These tasks were designed to 
promote different types of browsing behaviour: searching 
and reading. Participants carried out one of each type of 
task. Therefore, the independent variables of the experiment 
were: 

• website; 

• aesthetic appeal of the website; 

• predicted interest in the given task; 

• type of task carried out. 

The dependent variables of the experiment were the cur¬ 
sor tracking data (gathered automatically) and engagement 
data (gathered by survey). 

The study was designed to promoted a high level of eco¬ 
logical validity. One of the ways this was achieved was by 
recruiting participants through Amazon Mechanical Turk 
(MTurk), which allowed participants to carry out the ex¬ 
periment in their natural environment with their own hard¬ 
ware. Another way that ecological validity was promoted 
was the use of fully interactive websites delivered through a 
transparent proxy to encourage natural browsing. 

This potentially introduced several confounding variables, 
such as gender and hardware that were not controllable. One 
aim of the study was to find out how such demographic data 
influenced the cursor data. Therefore demographic data was 
also gathered. To help ensure a reliable analysis the groups 
were automatically counter-balanced by gender. Other de¬ 
mographic data were gathered but not controlled. 

3.3 Websites 

The websites used in the experiment were Wikipedia (http: 
//en.wikipedia.org) and BBC News (http://bbc.com). The 
websites were chosen for several reasons: 

• both promote a neutral point of view within their con¬ 
tent; 

• both have sufficiently sized articles and comparable 
types of content; 

• both provided search tools for finding content; and 


• both were large and popular websites at the time of the 
study (Wikipedia was ranked at #6 and BBC News at 
#81 by Alexa’s US traffic rankings). 

When selecting content, care was taken with both websites 
to avoid contentious or provocative subjects which could af¬ 
fect the participant’s mood. As Wikipedia’s content is user¬ 
generated, only protected pages that did not allow editing 
were used in the experiment. 

The BBC Website was not the first choice, but as the 
study was carried out during the 2012 presidential elections, 
the decision was made to avoid US news websites to avoid 
giving participants tasks that would conflict with their po¬ 
litical beliefs. BBC News is generally considered to be a 
relatively unbiased news website 0 which was particularly 
important as biased news could have created a considerable 
effect on the study which we wanted to avoid. 

3.4 Aesthetics 

Two interfaces were created for the websites: one ‘nor¬ 
mal’ and one ‘ugly’. The ‘normal’ version of the website 
was slightly modified by removing a small number of inter¬ 
face elements which included login forms, user comments, 
social media links and advertising. External links would ap¬ 
pear normal but would no longer function. Figures [Ta| and 
[13 show that the standard interface was retained as much as 
possible for both sites. 

In creating the ‘ugly’ versions the aim was to make the 
sites aesthetically unappealing without drastically affecting 
the site’s usability or layout. This resulted in several changes 
as follows. 

3.4.1 Font 

The website fonts were replaced with Comic Sans for text 
and Impact for Headings. Comic Sans has been shown to be 
less aesthetically pleasing than other common fonts (U it 
has been widely ridiculed as a poor font and there is a well- 
known movement attempting to raise awareness of its misuse 
and shortcomings]^] Impact does not face similar criticisms, 
but the sharp lines and heavy weight of the font created a 
contrast with Comic Sans. 

3.4.2 Colour 

The colour scheme was changed to create a high contrast. 
Headings were changed to be light blue, the background dark 
blue, normal text yellow and hyperlinks red. Yellow is gen¬ 
erally considered to be a very poor choice for text as it is 
difficult to read. However, it should remain readable given 
the high contrast with the blue background. Style guides 
generally advocate no more than 2 or 3 colours for a web¬ 
site, so the use of 3 strong colours was expected to produce 
low aesthetic scores. 

3.4.3 Advertising 

A selection of banner advertisements in various standard 
sizes were downloaded from the internet. These were then 

x The BBC was famously called a ’’Stateless Person’s Broad¬ 
casting Corporation” by a prominent British politician dur¬ 
ing the Falklands War, who felt the BBC coverage was ’’un¬ 
patriotic”. 

2 Ban Comic Sans: http://bancomicsans.com 


injected into pages in appropriate areas such as the sidebar 
and between sections of text. Around 25% of the banners 
were animated, resulting in an extremely ‘busy’ appearance. 

3.4.4 Navigation 

The term ‘mystery meat navigation’ refers to hiding nav¬ 
igation elements until the cursor hovers over themjj This 
prevents guests from seeing all options at once, which com¬ 
plicates and slows down navigation. To avoid introducing 
a serious usability issue, navigation links were coloured to 
provide minimal contrast with the background, but would 
change to a high-contrast colour when the cursor was posi¬ 
tioned over them. 

The result of these changes can be seen in Figures |lb| 
and [id] While the ‘ugly’ versions of the websites at first 
appear to be significantly different, the general layout of the 
pages remained the same. A pilot study was carried out 
to make sure that the ‘ugly’ websites remained usable but 
were indeed aesthetically unappealing. Testers reported no 
significant usability problems and confirmed that the ‘ugly’ 
variants of the sites were visually unattractive. 

3.5 Tasks 

Participants were asked to carry out two tasks: a task to 
promote searching behaviour called the Search Task and an¬ 
other to promote close reading called the Reading Task. Par¬ 
ticipants were asked to rate their interest in each website’s 
content categories which allowed participants to be split into 
‘high interest’ and ‘low interest’ groups. Wikipedia had 12 
content categories while BBC News had 14. 

3.5.1 Search Task 

The search task involved giving participants a quiz ques¬ 
tion which could be answered using the website. Questions 
were selected where answers could be found on both sites. 
This resulted in 62 questions in total, 61 of which could 
be answered using Wikipedia and 53 of which could be an¬ 
swered using BBC News. Participants were initially placed 
on the website’s homepage and were expected to use the 
search feature or navigation to find the answers. Care was 
taken to ensure that suitable search terms were included 
when phrasing the questions. Some examples of the ques¬ 
tions were: 

• What is ‘Geminoid F’? 

• In 2008 and 2009 three major car manufacturers with¬ 
drew from Formula 1 citing rising costs. Which was 
the first team to do so in late 2008? 

• There are many different types of phobia. What is 
Gephyrophobia a fear of? 

3.5.2 Reading Task 

Participants were asked to read an article and write a 
“one or two sentence summary of the content in their own 
words. Some Wikipedia articles were too long for this, so 
participants were asked to read specific sections. A mini¬ 
mum of two articles were selected for each subject resulting 
in 24 Wikipedia articles and 28 BBC News articles. Arti¬ 
cles were chosen to avoid contentious subjects and based on 
their length. The average word count of Wikipedia articles 

3 Web Pages That Suck: http://www.webpagesthatsuck.com 
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Sir Timothy John "Tim" Berners-Lee OM, kbe, frs, FREng, frsa (born 8 June 
1955). 111 also known as 'TimBL'' is a British computer scientist and the inventor 
of the World Wide Web. He made a proposal for an information management 
system in March 1989, 131 and on 25 December 1990, with the help of Robert 
Cailiiau and a young student at CERN, he implemented the first successful 
communication between a Hypertext Transfer Protocol (HTTP) client and server via 
the Internet. 141 

Berners-Lee is the director of the World Wide Web Consortium (W3C), which 
oversees the Web's continued development. He is also the founder of the World 
Wide Web Foundation, and is a senior researcher and holder of the Founders 
Chair at the MIT Computer Science and Artificial Intelligence Laboratory 
(CSAIL). 151 He is a director of the Web Science Research Initiative (WSRI), 1 ® 1 and 
a member of the advisory board of the MIT Center for Collective Intelligence. 17 !® 1 
In 2004. Berners-Lee was knighted by Queen Elizabeth II for his pioneering 
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National Academy of Sciences. 110 ! 111 He was honoured as the ''Inventor of the 
World Wide Web” during the 2012 Summer Olympics opening ceremony, in which 
he appeared in person, working at a NeXT Computer at the London Olympic 
Stadium. 1121 He tweeted "This is for everyone”, 1131 which was instantly spelled out 
in LCD lights attached to the chairs of the 80.000 people in the audience. 1121 
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Early life 

Berners-Lee was bom in southwest London, England, on 8 June 1955, one of four 
children born to Conway Berners-Lee and Mary Lee Woods. His parents worked 
on the first commercially-built computer, the Ferranti Mark 1. He attended Sheen 
Mount Primary School, and then went on to attend south west London's 
independent Emanuel School from 1969 to 1973. 191 A keen trainspotter as a child, 
he learnt about electronics from tinkering with a model railway. 1141 He studied at 
Queen's College, Oxford, from 1973 to 1976, where he received a first-class 
degree in physics. 111 
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Figure 1: Comparison of normal web pages with the ‘ugly’ variants. The latter were modified by changing the colours to 
create a high contrast, changing the fonts to Comic Sans and Impact, the injection of advertising banners and the obfuscation 
of navigation elements. 












































was 862 (a = 260) and 840 (a = 174) for BBC News. For 
example, some of the articles used were: 

• Wikipedia - The Nile, Section 4.4: The search for 
the source of the Nile 

• Wikipedia - Archimedes, Section 1: Biography 

• BBC News - Olympic torch: Torchbearer proposes 
during relay 

• BBC News - Particles point way for Nasa’s Voyager 

3.6 Measurements 

Three core sets of measurements were taken during the 
study: cursor, engagement and demographic. 

3.6.1 Cursor Metrics 

Rodden et al. 122 identified 3 ways in which users could 
use the cursor when reading web search results: 

• The mouse could follow the eye horizontally; 

• The mouse could follow the eye vertically; 

• The mouse could ‘mark’ results. 

There is a range of things the cursor could do while the 
user reads the page. The mouse following the eye horizon¬ 
tally would likely be signified by a slow and steady horizontal 
movement in the direction of the text followed by a ‘flick’ 
back to the next line. Following the text vertically would 
result in a slow mouse movement in a downwards direction. 
Marking the text, or interesting areas, would perhaps be 
signified by long periods of inactivity in whitespace. Other 
behaviours might also be expected [t| [l2] , but as most exist¬ 
ing work is focussed on search engines, new behaviours could 
be observed in the reading task. Existing work shows that 
the most interesting areas of observation for cursor tracking 
are the speed and frequency of movements, the number of 
mouse clicks and the amount of time that the cursor lies 
still. This led to the creation of the following 5 measures: 

• Movement Speed (MS) : Average speed over all move¬ 
ments in pixels per second. 

• Movement Rate (MR): Number of distinct move¬ 
ments made per second. 

• Click Rate (CR): Number of mouse clicks that were 
made per second. 

• Pause Length (PL): Average length of a pause in 
seconds. 

• Percentage of Time Still (PTS): Percentage of 
time where the cursor did not move. 

Cursor data was gathered as a system of x and y coordi¬ 
nates gathered at a rate of 24fps as suggested by Leiva [l4] . 
The high recording rate allowed even very small pauses to 
be identified, which allowed the data to be split into distinct 
movements. The movements were then categorised as either 
pauses, movements or scrolls. A pause happened when the 
cursor was stationary and a movement occurred when the 
cursor moved. If 99% or more of the movement was ver¬ 
tical then that movement was reclassified as a scroll. As 


the websites had very different vertical sizes, and as the ad¬ 
vertisements injected into the ‘ugly’ websites increased their 
vertical size, vertical scroll movements would be likely to 
confound the results and as such were not included in the 
analysis. 

3.6.2 Engagement 

Attfield et al. [ 2 ] defined engagement as “the emotional, 
cognitive and behavioural connection that exists, at any point 
in time and possibly over time, between a user and a re¬ 
source”. O’Brien & Toms 116 defined engagement as “ char¬ 
acterized by attributes of challenge, positive affect, endura¬ 
bility, aesthetic and sensory appeal, attention, feedback, vari¬ 
ety/novelty, interactivity, and perceived user control.”. En¬ 
gagement is a desirable property for websites, as engaged 
users are (1) likely to view more content and spend more 
time on a website and (2) more likely to return to the web¬ 
site at a later date. Engagement is difficult to measure how¬ 
ever, and is normally assessed using eye tracking methods. 
Studies have shown that gaze and cursor often correlate, 
so it may be possible to determine engagement from cursor 
movements. 

O’Brien & Toms 17 created a survey to measure user 
engagement, defining it in terms of attention, usability, aes¬ 
thetics, endurability and novelty. Attfield et al. [ 2 ] built on 
this work and argued that the relevant factors were atten¬ 
tion, affect, aesthetics, endurability, novelty, control, repu¬ 
tation and user context. For some of these factors there are 
well-known surveys that can be used for assessment, such as 
the PAN AS scale for measuring affect |25] . 

In this study engagement was measured using two tools. 
Firstly, the PAN AS [25] was used to measure positive and 
negative affect. Secondly focussed attention, perceived us¬ 
ability, novelty, aesthetics and involvement were measured 
based on the work of 0‘Brien & Toms [l7]. The wording of 
their survey components was modified to suit the study by 
removing the shopping-task context. 

3.6.3 Demographic Data 

Demographic data was gathered to identify any interest¬ 
ing effects. All participants were required to be US resi¬ 
dents to participate in the study. Participants were asked 
to provide their age, gender, handedness, computer expe¬ 
rience, media consumption habits and input hardware. To 
help participants identify their input hardware, pictures of 
various devices were provided, as shown in Figure [2] 

3.7 Implementation 

Implementing the study required two components: a proxy 
server to manipulate webpages and a cursor tracking system. 
The cursor tracking software used was the open-source sys¬ 
tem Simple Mouse Tracking 2 (smt 2 ), created by Leiva 14 . 

The proxy was implemented in PHP and was transparent 
to the user (i.e. participants did not have to change their 
browser settings or install software). The proxy would take 
a standard HTTP GET request containing the address of 
the target website and any additional parameters to include 
in the query. We called this proxy a proxy trap as one of 
its primary functions was to ensure that visitors could not 
leave the original site by clicking links. When the proxy was 



Figure 2: The input hardware question as presented in the 
study. 


called it carried out the following actions: 

1. Reconstruct the target URL from the HTTP request. 

2. Fetch the target page using PHP cURL. 

3. Strip out any ‘banned’ elements, such as html <div> 
elements containing social media links or javascripts 
which loaded dynamic content. 

4. Remove the target of anchors which led to external 
sites and redirect internal links to point back to the 
proxy (preventing users from leaving the proxy). 

5. Append an overriding stylesheet to create the nor¬ 
mal/ugly interfaces. 

6. Insert javascript that enabled the smt 2 tool. 

7. Insert an HTML element at the top of the webpage 
which was used to send messages or links to the par¬ 
ticipant regarding the study. 

8. Send the resulting page to the participant’s browser. 

The advantage of this system was that only the text of a 
given website would pass through the proxy, leaving images 
and scripts to load from the original source. This resulted 
in good performance with no noticeable lag when comparing 
the proxy to the original website. 

4. PROCEDURE 

Amazon Mechanical Turk (MTurk) was used to carry out 
the experiment. The study was split into two parts, with 
Wikipedia carried out first (n — 160) and BBC News sec¬ 
ond (n — 165). Work units were placed on MTurk with 
a unique code. MTurk workers would read the description 
of the study, accept the work unit, then click a link to the 
study website where they would input their unique code. 
Workers were informed that they would not be paid if they 
participated in the study multiple times. 

At the start of the study participants were asked to verify 
their consent with regard to the data that would be gath¬ 
ered. Following this the participant would be asked their 
gender, which was used to balance the gender distributions 


in each groujj^] The 4 groups varied in terms of task inter¬ 
est and website aesthetics. Participants were then asked to 
fill out a survey based on their interests relative to the web¬ 
site’s content. Depending on their group, they were then 
given high-interest or low-interest content. 

Before the first task the participant was asked to fill out a 
PAN AS questionnaire to assess their affect prior to the first 
task. The two tasks (search and reading) were delivered 
in a random order. Both followed the same format, which 
started by describing the task then directing the participant 
to the proxy. A banner at the top of the website would pro¬ 
vide a reminder of the task and included a button to press 
when the participant was finished. When pressed the but¬ 
ton would take the participant to a page where they could 
type in their answer. For the reading task a word counter 
was provided that would limit the amount participants could 
write by turning green after 10 words and red after 50 words. 
With each task completed, participants were administered 
another PAN AS followed by the engagement survey. The 
questions in both surveys were randomly ordered for every 
participant. At the end of each task participants were given 
the option to provide comments or feedback if they wished. 
Finally, participants were asked to fill out the demographic 
survey. 

With the work completed, participants were then given 
a second code to insert back into MTurk to complete the 
work unit. This process took around 15-25 minutes per par¬ 
ticipant and each participant was paid $2.50. The study 
was split up into batches of 80 participants which required 
around 2 hours each to complete. 

4.1 Participant Validation 

To verify that participants had correctly carried out their 
work two methods were employed. Firstly, the response 
codes entered by participants to complete the work were 
compared to the codes originally provided to them. Any 
codes which did not match suggested that the participant 
had simply entered a random code to complete the work 
without taking the study. Only 1 participant did this and 
their work unit was simply rejected and returned to MTurk 
for another worker to complete. 

The second stage of verification was to manually check 
participant’s answers to both the search and reading tasks. 
This was labour-intensive but revealed that the majority of 
the MTurk workers completed their tasks to a high stan¬ 
dard. Of 325 workers, only 1 participant’s work was consid¬ 
ered so poor that it had to be rejected. Although it was not 
required, a large number of participants left comments on 
their experience. 

The IP addresses of workers were also checked which re¬ 
vealed two duplicates. While this could be people with mul¬ 
tiple accounts, it could also show people sharing an internet 
connection and was not considered suspicious. 

5. RESULTS 

The study examined the effects of four dependent vari- 

4 On average, each Wikipedia group had 18 males and 22 
females and each BBC group had 22 males and 19 females. 












ables (task, website, predicted interest and interface version) 
on two sets of independent variables (cursor movement and 
engagement). 

5.1 Cursor Movements 

The five cursor metrics measured were analysed in turn 
using a multi-factorial ANOVA where the within-subjects 
factor was task and the between-groups variables were web¬ 
site, interface version and predicted interest. Participant 
input hardware was included as a covariant. Note that the 
results showing the effects of hardware on the model are 
presented later in this section. 

5.1.1 Movement Speed 

No effects were found of task, website, interface version or 
predicted interest on MS and no interactions were found. 

5.1.2 Movement Rate 

Task was found to have a significant affect on MR 1, 244) = 
18.24, p < .001, uj 2 — .07) but no other effects or interactions 
were found. 

5.1.3 Click Rate 

No effects were found of task, website, interface version or 
predicted interest on CR and no interactions were found. 

5.1.4 Pause Length 

Website was found to have an effect on the PL 1, 244) = 
3.97, p < .05, uj 2 m .02), while no effects were found for 
task, predicted interest or interface. Task type and predicted 
affinity interacted to affect PL (T(l,244) = 3.97, p < .05, 
uj 2 =» .02). No other interactions were found. 

5.1.5 Percentage of Time Still 

Task was found to affect PTS (F( 1,244) = 29.57, p < 
.001, uj 2 — .11), but no effects were found for website, in¬ 
terface or predicted interest and no interaction effects were 
found. 

The results suggest that the two interfaces made little dif¬ 
ference to the cursor movements. Predicted interest seemed 
to have no effect at all, but interacted with task type to 
affect PL. Website only had an effect on PL. The results 
showed that the type of task affected both MR and PTS , 
but was not found to affect the other cursor metrics. 

The ability to distinguish between searching and read¬ 
ing behaviour could be very useful in understanding web¬ 
site visitors. Logistic regressiorQ can be used to evaluate if 
the user’s activity can be predicted from their cursor move¬ 
ments. The variables with the strongest effects were MR and 
PTS. However, the model produced by these variables was 
not found to be a good fit for the data. Of the remaining 
three variables, only PL was found to correlate to task type 
(r(550) = —.33, p < .001). The collinearity of MR , PTS 
and PL was found to be acceptable {VIF— 3.22) and so all 
three variables and their interactions were used to carry out 
a binary logistic regression with the backwards-stepwise LH 
method. The resulting model is shown in Table [l] and was 
able to correctly identify user activity in 74.5% of cases. 

5 For information on logistic regression see Peng et al. 


Table 1: Binary Logistic Regression Analysis of Task using 
MR, PL and PTS. 



IS. 

SE(3 

W 

e 13 

Constant 

11.59*** 

2.89 

16.13 

NA 

MR 

-10.33*** 

2.15 

23.07 

0.00 

PL 

-7.93** 

3.00 

7.00 

0.00 

PTS 

-13.49*** 

3.08 

19.24 

0.00 

PL *PT5 

7.71* 

2.99 

6.65 

2238.83 

MR *PTS 

13.90*** 

2.72 

26.07 

1082822.66 

PL *MR *PT5 

3.89** 

1.29 

9.12 

49.00 


Note: Model produced after 2 iterations, 1 interaction re¬ 
moved. Model: x 2 (6) = 186.82, p < .001. Cox & Snell 
R 2 —. 29, Nagelkerke R 2 —. 38. Hosmer-Lemeshow goodness- 
of-ht test: (x 2 (8) = 9.04, p — .34). Sig: [*]=P < 
.05, [**]=p < .01, [***]=p < .001. [e^Odds Ratios, 
[W]= Wald’s x 2 - 

5.2 Engagement Data 

To examine the effects of the four factors on the engage¬ 
ment data each engagement component was taken in turn 
and analysed with the multi-factorial ANOVA. 

5.2.1 Affect 

No main effects were found from any of the factors on 
positive or negative affect. However, an interaction effect 
was found of task type and predicted interest on positive af¬ 
fect (F(l,244) = 4.35 ,p < .05, uj 2 = .02). The data suggest 
that positive affect increased during the reading task when 
participants were given subjects that match their predicted 
interest levels. 

5.2.2 Focussed Attention 

An interaction effect between predicted affinity, interface 
version and task type was found (T(l,244) = 4.03 ,p < 
.05, uj 2 m .02), but no other effects or interactions were 
found. 

5.2.3 Perceived Usability 

Task type affected perceived usability (T(l,244) = 5.15, 
p < .05, uj 2 — .02). Interface version, predicted interest and 
website were not found to have an effect. Task type and 
interface version created an interaction effect on perceived 
usability (T(l,244) — 4.37, p < .05, uj 2 = .02). Task type, 
interface version and website interacted to affect perceived 
usability (F( 1, 244) = 5.97, p < .05, a; 2 = .02). 

5.2.4 Aesthetics 

Interface version (normal vs. ugly) was not found to affect 
the aesthetic scores, but website was found to have an effect 
(T(l,244) = 20.47, p < .001, uj 2 — .08). Task type and 
predicted interest were not found to affect aesthetic scores, 
but together created an interaction effect (T(l,244) = 4.6, 
p < .05, uj 2 s= .02). No other interactions were found. 

5.2.5 Novelty 

None of the factors were found to affect novelty. 

5.2.6 Involvement 

None of the factors were found to affect involvement. 
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Table 2: Participant cursor hardware. 



Wikipedia 

BBC News 

Mouse 

110 

94 

Trackpad 

45 

68 

Touchscreen 

4 

3 

Trackball 

1 

0 


There were two surprising results. Firstly, the two inter¬ 
face versions were not found to have significantly different 
aesthetics scores, yet the two websites were with a surprising 
effect size (cc 2 = .08). Second, the participant’s predicted in¬ 
terest did not appear to influence their effect ratings, which 
would be expected if they were actually given more or less 
interesting tasks. However some effect was found when con¬ 
sidering the reading task. 

5.3 Demographic 

The data was checked for correlations between the gath¬ 
ered demographic data and various cursor movement data 
as described in Section [3761 All correlations were carried out 
over the whole data set controlling for task, website, inter¬ 
face version and task ordering. The interface version must 
be included as a control variable as the addition of adver¬ 
tisements on the ‘ugly’ webpages resulted in more vertical 
space and therefore more vertical scrolling. 

There were considerable differences in speed between par¬ 
ticipants. While we checked computer experience, the speed 
of participants did not correlate strongly to this; interest¬ 
ingly, speed correlated strongly to the age of the participant 
instead. 

To address this the movements were divided by the time, 
giving us time-independent data. The data was then divided 
by the total number of movements to provide the percentage 
of movements in a minute, accounting for the differences in 
activity rate between participants. The data represents the 
percentage of each type of cursor movement a participant 
made over an average minute. 

5.4 Predicting Engagement 

Cursor data were checked for correlations with the engage¬ 
ment data, but no correlations were found. The tests were 
re-run to control for any variance caused by cursor hard¬ 
ware, task and website; however there still appeared to be 
no correlation between the cursor metrics tested and the 
subjective survey responses of the participants. 

5.5 Hardware 

Table [2] shows that most participants used either a mouse 
or trackpad, and due to the small amount of data for other 
hardware types, the analysis was carried out using only 
mouse and trackpad data. Included as a co-variate in the 
model described in the ‘Cursor Movements’ section, hard¬ 
ware was found to affect: 

5 . 5 .1 Movement Speed 

Input hardware was found to affect MS (F( 1, 244) = 39.19, 
p< .001, cj 2 = .14). 

5 . 5 .2 Movement Rate 


Table 3: Binary Logistic Regression Analysis of Input 
Hardware using MS , MR and CR. 



p- 

SE(3 

w 

e p 

Constant 

-1.255*** 

0.268 

21.86 

NA 

MR 

-1.758*** 

0.446 

15.57 

0.17 

MS 

0.001*** 

0.000 

17.69 

1.00 

MR * CR 

0.003** 

0.001 

10.57 

1.00 

MS * CR 

-1.255*** 

0.000 

8.21 

1.00 


Note: Model produced after 4 iterations, 1 variable and 2 in¬ 
teractions removed. Model: x 2 (4) = 133.67, p < .001. Cox 
& Snell R 2 =.22, Nagelkerke R 2 =.30. Hosmer-Lemeshow 
goodness-of-fit test: (x 2 (8) = 8.25, p = .41). For p val¬ 
ues, [**]=p < .01, [***]=p < .001. [e^]=Odds Ratios, 
[W]= Wald’s x 2 - 

Input hardware was found to affect MR (F( 1,244) fa 
26.61, p < .001, a; 2 = .10). 

5.5.3 Click Rate 

Input hardware was found to affect CR (F( 1, 244) = 47.53, 
p < .001, uj 2 = .16). 

5.5.4 Pause Length 

Input hardware was found to affect PL (T(l, 244) = 17.75, 
p < .001, uo 2 — .07). Task was found to interact with hard¬ 
ware to affect PL (F( 1,244) = 11.52, p < .01, u? fa .05). 

5 . 5.5 Percentage of Time Still 

Input hardware was found to affect PTS (F( 1,244) fa 
23.39, p < .001, a; 2 = .09). 

These results show that input hardware has a strong ef¬ 
fect on the behaviour of the cursor. Additionally, task and 
hardware interacted to affect PL. It would be useful if in¬ 
put hardware could be determined from cursor movements. 
This can be tested with binary logistic regression. Both PL 
and PTS were found to correlate quite strongly with MR , 
so both could not be included in the regression. The results 
showed that MS , MR and CR had the strongest effect and 
a low degree of collinearity ( VIF — 1.07), so they were used 
to carry out a binary logistic regression using the backwards 
stepwise LH method. Table [3] shows the resulting model 
which is able to correctly identify input hardware in 75.5% 
of cases. 

6. DISCUSSION 

There were two important findings from this study. Firstly, 
the study revealed some of the differences in cursor move¬ 
ments between users searching and users reading. Secondly, 
the study identified some of the differences between the cur¬ 
sor behaviours of a mouse and a trackpad. The regressions 
provided in this paper show that both the user’s activity 
(reading or searching) and input hardware (mouse or track¬ 
pad) can be predicted from the behaviour of the cursor. This 
could have several interesting applications in a range of do¬ 
mains. 

An obvious application is to form a better understanding 
of website guests. The ability to differentiate between a user 
searching and reading could have benefits in several areas, 










Figure 3: Graphs showing the effects of input hardware and task on the cursor metrics MS, MR, CR, PL and PTS. 


such as when carrying out ‘in the wild’ usability studies. 
For example, if using cursor data to evaluate the interface 
using the methods described by [3j, tracking data could be 
split into ‘searching’ and ‘reading’ to provide a better un¬ 
derstanding of the user’s goals, allowing usability issues that 
affect one type of user to be isolated. 

This information could also be useful as a part of content 
management; Wikipedia for example would be expected to 
have a large number of guests who are both searching and 
reading. Wikipedia promotes certain articles on its front 
page, but given two pages with equal traffic, which one 
should be promoted? If the traffic to both pages could be 
split into readers and searchers then the page with the most 
readers would likely be the page to promote. This could 
also be used to promote ‘searched’ pages above ‘read’ pages 
on the site’s search engine. Another example is that shop¬ 
ping sites could direct advertising towards products based 
on other products that guests took time to read, instead of 
assuming that all previously-viewed products were equally 
indicative of the type of products that the customer wants 
to buy. 

Identifying the input hardware of users is an important 
step in attempting to interpret cursor data as the results 
of this work have clearly shown that input hardware has 
an effect on other measurements. The ability to identify 
cursor hardware will be highly important to making reli¬ 
able observations from cursor data. Assuming that the find¬ 
ings presented here can be extended to detect other types of 
hardware, in particular touch screens or stylus interactions, 
then it could also have interesting applications in website 
design. While the results of this study suggest that the 
mouse and trackpad are the most common input devices, 
touchscreen technology is becoming increasingly common. 
Touchscreen technology renders the traditional cursor some¬ 
what obsolete: the finger becomes the cursor, and as such 
common website design elements such as drop-down menus 
and hover-to-enlarge images might not be suitable ways to 
interact. The ability to identify the user’s hardware could 
be useful in the same way that developers would previously 
monitor browser versions; although in this case it would not 


Table 4: Summary of comments left by participants re¬ 
garding the interface. Total includes comments that did not 
mention the interface. 


Comments 

Wikipedia 

BBC News 

Normal 

Ugly 

Normal 

Ugly 

Positive 

0 

1 

0 

0 

Neutral 

1 

0 

0 

0 

Negative 

1 

30 

0 

23 

Total 

61 

70 

44 

51 


be to provide workarounds for browser-specific bugs, but to 
provide a slight variant of the website that is optimised for 
the user’s hardware to provide more natural interactions. 

The most surprising result from the study was that the 
‘ugly’ variants of the sites did not result in lower aesthetics 
scores, yet the BBC News website was considered to be more 
aesthetically pleasing than Wikipedia. This finding is made 
even more unusual when participant comments are taken 
into consideration: 

• [Wikipedia] “The website was simply awful. Ads flash¬ 
ing everywhere, poor text colors on a dark blue back¬ 
ground.” 

• [Wikipedia] “The webpage was entirely blue. I don’t 
know if it was supposed to be like that, but it definitely 
detracted from the browsing experience.” 

• [Wikipedia] “Only that it has slightly interested me as 
to why the site made it intentionally difficult.” 

• [BBC News] “The font and the colors used were really 
distracting to me.” 

• [BBC News] “Multiple font colors within the same page 
= hard to read.” 

• [BBC News] “The website’s layout and colour scheme 
were a bitch to navigate and read.” 

• [BBC News] “Comic sans is a horrible font.”. 












































Participants were not prompted to discuss the interface 
in any way; the comment box was simply titled “Comments 
(optional)”. Yet as Table [4] shows, a large number of partic¬ 
ipants left negative comments about the ‘ugly’ versions of 
the interface. While a single person “loved the colours”, this 
may have been sarcasm. These comments suggest that the 
aesthetics scores do not accurately reflect the aesthetics of 
the interfaces. Further work would be needed to fully ex¬ 
plore this unexpected result. 

There was no correlation between the predicted interest 
in the task and a participant’s reported interest in the task. 
There were also no significant correlations between any of 
the cursor metrics and the engagement data from the sur¬ 
veys. While these results are not suspect, the lack of cor¬ 
relation between aesthetic scores and user comments cre¬ 
ates questions about the reliability of the engagement sur¬ 
vey data. Due to the between-groups design of the ex¬ 
periment, participants had no basis of comparison for the 
majority of the engagement metrics, which could have im¬ 
pacted on their answers. Another possible source for inter¬ 
ference is the Hawthorne effect; participants were notified 
that their cursor was being tracked, which could have in¬ 
fluenced both their cursor movements and the engagement 
measures. While it is possible that MTurk users might have 
‘randomly’ clicked through the survey, we do not believe this 
was the case; the manual result verification showed that all 
questions were answered to a high standard, and the major¬ 
ity of participants (70%) took extra time to answer optional 
parts of the survey. Therefore, we believe that the mouse 
tracking data is sound but there is a possibility that the 
engagement data is not reliable. Further studies would be 
needed to explore this finding. 

If the engagement data is reliable, then the results suggest 
there is no relationship between these cursor and engage¬ 
ment measures. Yet these measures represent only a fraction 
of the data that can be extracted from the cursor; alterna¬ 
tive metrics are likely to provide even more information. To 
further investigate ways to predict engagement from cursor 
data, one approach might be to apply statistical clustering 
methods to cursor data streams. This could reveal specific 
types of movement in the same way that gaze tracking stud¬ 
ies observe saccades and fixations. Recent work by Arapakis 
et al [2] suggests this is the case, as they were able to cor¬ 
relate certain activities with subjective engagement metrics, 
specifically focus attention and affect. 

There could be several reasons for these negative results 
(beyond the typical drawbacks associated with using ques¬ 
tionnaires as discussed above): flawed methodology, a non¬ 
existent signal or use of the wrong measures. In terms of 
the methodology, due to between-groups design of the ex¬ 
periment participants had no basis of comparison for the en¬ 
gagement metrics, which could have impacted their answers. 
However, performing a within-subjects study where partici¬ 
pants interacted with two variants of the website could have 
potentially introduced confounding variables. It could also 
be that the experience induced by the ‘ugly’ version of the 
interface was not negative enough for users to become an¬ 
noyed, and thus down-marking it. However experimenting 
with an uglier interface would not be valid (unless the aim 
is to test this hypothesis) as measuring user engagement 


only makes sense if the experience is in principle positive. 
Although usability is an important characteristic of user en¬ 
gagement, usability issues must be fixed first before thinking 
about user engagement measurements. 

In hindsight, this study demonstrates that designing ex¬ 
periments to obtain reliable insights about user engagement 
and its measurement remains highly challenging. Finally, 
not finding a signal may simply mean that some of the met¬ 
rics used were not the correct ones. This is in fact what we 
believe is happening here. The cursor metrics were not the 
right ones to differentiate between the levels of engagement 
experience, which is supported by recent work by Arapakis 
et al. [ 2 ] which showed that more complex mouse metrics, 
based on mouse gestures, did correlate with focus attention 
and affect engagement metrics. 


7. CONCLUSION 

The results of this study revealed how a user’s activity 
(searching for information or reading) and their cursor hard¬ 
ware (mouse or trackpad) can be predicted purely from cur¬ 
sor tracking data. While no conclusive evidence was found 
linking participant engagement to cursor data, there are still 
several interesting paths for future work such as evaluating 
further cursor data metrics and linking these results with 
traditional web analytic data. The 5 simple measurements 
evaluated in this study have allowed detection of user task 
and input hardware; yet it is expected that there will be a 
much greater range of cursor metrics not yet evaluated which 
can provide a deeper understand of the user (as shown in re¬ 
cent work 2 in the context of user engagement self-report 
measures H31). Future work should focus on exploring other 
metrics. Another future direction of this work would be 
to combine it with existing website analytics such as ‘page 
views’ and ‘time spent on a page’ to improve the reliability 
of these models. More work should be carried out to evalu¬ 
ate if the input hardware identification model presented here 
can be extended to other types of input hardware such as 
touchscreens. 

This work adds to a growing body of research that sug¬ 
gests that tracking cursor movements might lead to new 
techniques that would allow powerful and inexpensive user 
profiling and usability testing. It is hoped that this work 
will form the basis of a better understanding of how we use 
the cursor to interact, and how this information can be used 
to improve website designs and services. 
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